Query2Vec: NLP Meets Databases for Generalized Workload Analytics

نویسندگان

  • Shrainik Jain
  • Bill Howe
چکیده

We consider methods for learning vector representations of SQL queries to support generalized workload analytics tasks, including workload summarization for index selection and predicting queries that will trigger memory errors. We consider vector representations of both raw SQL text and optimized query plans, and evaluate these methods on synthetic and real SQL workloads. We find that general algorithms based on vector representations can outperform existing approaches that rely on specialized features. For index recommendation, we cluster the vector representations to compress large workloads with no loss in performance from the recommended index. For error prediction, we train a classifier over learned vectors that can automatically relate subtle syntactic patterns with specific errors raised during query execution. Surprisingly, we also find that these methods enable transfer learning, where a model trained on one SQL corpus can be applied to an unrelated corpus and still enable good performance. We find that these general approaches, when trained on a large corpus of SQL queries, provides a robust foundation for a variety of workload analysis tasks and database features, without requiring application-specific feature engineering. PVLDB Reference Format: Shrainik Jain, Bill Howe, Jiaqi Yan, and Thierry Cruanes. Query2Vec: An Evaluation of NLP Techniques for Generalized Workload Analytics. PVLDB, 11 (5): xxxx-yyyy, 2018. DOI: https://doi.org/TBD

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Query Clustering using Segment Specific Context Embeddings

This paper presents a novel query clustering approach to capture the broad interest areas of users querying search engines. We make use of recent advances in NLP word2vec and extend it to get query2vec, vector representations of queries, based on query contexts, obtained from the top search results for the query and use a highly scalable Divide & Merge clustering algorithm on top of the query v...

متن کامل

WiSeDB: A Learning-based Workload Management Advisor for Cloud Databases

Workload management for cloud databases deals with the tasks of resource provisioning, query placement, and query scheduling in a manner that meets the application’s performance goals while minimizing the cost of using cloud resources. Existing solutions have approached these three challenges in isolation while aiming to optimize a single performance metric. In this paper, we introduce WiSeDB, ...

متن کامل

Generalized Snapshot Isolation and a Prefix-Consistent Implementation

Generalized snapshot isolation extends snapshot isolation as used in Oracle and other databases in a manner suitable for replicated databases. While (conventional) snapshot isolation requires that transactions observe the “latest” snapshot of the database, generalized snapshot isolation allows the use of “older” snapshots, facilitating a replicated implementation. We show that many of the desir...

متن کامل

All Schedules Dynamic

This paper presents the AAected Set Priority Ceiling (ASPC) concurrency control protocol for real-time object-oriented databases. The protocol is based on a combination of a semantic locking technique and priority ceiling techniques. The paper speciies six criteria for real-time concurrency control: high concurrency, deadlock prevention, predictability, temporal consistency enforcement, logical...

متن کامل

In-Memory Data Analytics on Coupled CPU-GPU Architectures

In the big data era, in-memory data analytics is an effective means of achieving high performance data processing and realizing the value of data in a timely manner. Efforts in this direction have been spent on various aspects, including in-memory algorithmic designs and system optimizations. In this paper, we propose to develop the next-generation in-memory relational database processing techn...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1801.05613  شماره 

صفحات  -

تاریخ انتشار 2018